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Abstract 

Background: A wealth of genome sequences has provided thousands of genes of unknown function, but 
identification of functions for the large numbers of hypothetical genes in phytopathogens remains a challenge that 
impacts all research on plant-microbe interactions. Decades of research on the molecular basis of pathogenesis focused 
on a limited number of factors associated with long-known host-microbe interaction systems, providing limited 
direction into this challenge. Computational approaches to identify virulence genes often rely on two strategies: 
searching for sequence similarity to known host-microbe interaction factors from other organisms, and identifying islands 
of genes that discriminate between pathogens of one type and closely related non-pathogens or pathogens of a 
different type. The former is limited to known genes, excluding vast collections of genes of unknown function 
found in every genome. The latter lacks specificity, since many genes in genomic islands have little to do with 
host-interaction. 

Result: In this study, we developed a supervised machine learning approach that was designed to recognize 
patterns from large and disparate data types, in order to identify candidate host-microbe interaction factors. The 
soft rot Enterobacteriaceae strains Dickeya dadantii 3937 and Pectobacterium carotovorum WPP14 were used for 
development of this tool, because these pathogens are important on multiple high value crops in agriculture 
worldwide and more genomic and functional data is available for the Enterobacteriaceae than any other microbial 
family. Our approach achieved greater than 90% precision and a recall rate over 80% in 10-fold cross validation tests. 

Conclusion: Application of the learning scheme to the complete genome of these two organisms generated a list of 
roughly 200 candidates, many of which were previously not implicated in plant-microbe interaction and many of 
which are of completely unknown function. These lists provide new targets for experimental validation and 
further characterization, and our approach presents a promising pattern-learning scheme that can be 
generalized to create a resource to study host-microbe interactions in other bacterial phytopathogens. 
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Background 

Interactions between plant-associated microbes and their 
eukaryotic hosts are complex biological processes involv- 
ing hundreds, if not thousands, of genes from each organ- 
ism. Understanding the molecular mechanisms of such 
complex processes at the systems-scale is seriously ham- 
pered by the lack of a comprehensive list of gene products 
that contribute for even a single bacterial or fungal 
pathogen. Variation in lifestyles and pathogenic poten- 
tial between organisms makes the challenge all the 
greater. Genome sequencing has dramatically increased 
the potential for large-scale screens to identify genes in- 
volved with host-microbe interactions. Direct experimen- 
tal evidence is the obvious gold standard, but not all 
significant pathogens are experimentally tractable, and se- 
lection of experimental conditions and convenient hosts 
for high-throughput screens can limit discovery. More tar- 
geted experiments can be designed to probe function 
more completely, but these are time consuming and gen- 
erally limited to a smaller number of candidate genes. 
Further, it is unclear what experiments to conduct if a 
candidate gene is of completely unknown function. Im- 
portantly, genes of unknown function make up a sub- 
stantial fraction of each sequenced genome, and it is 
likely that among these lie some of the greatest poten- 
tial for discovery of truly novel aspects of host-microbe 
interaction (as well as many other complex biological 
processes). 

Computational approaches to identify potential host- 
microbe interaction factors and predict their specific 
functions can be a valuable way to guide experimenta- 
tion, and may be the only option for some recalcitrant 
organisms. Typical bioinformatics strategies include search- 
ing for sequence similarity to gene products known to con- 
tribute to host-microbe interaction in other organisms, and 
comparing genomes to identify gene islands that discrimin- 
ate between pathogens of one type and closely related non- 
pathogens or pathogens of a different type. Both strategies 
are useful, but the former is limited to known genes and de- 
tectable levels of sequence similarity, and thus excludes the 
vast collections of genes of unknown function. The latter 
lacks specificity, since many genes in genomic islands may 
have little to do with host interactions, and the definition of 
rules for the distribution across organisms can be arbitrary. 
There are no simple rules to define the relevant distribution 
for the set of orthologous genes across genomes, especially 
when there are a large number of genomes being compared. 
Further, it is preferable in many situations to factor in other 
features such as genome context or gene expression data as 
additional evidence sources to predict whether a gene is as- 
sociated with host-microbe interaction processes. 

More sophisticated computational prediction strategies 
can introduce a variety of other types of evidence, but inte- 
gration of diverse data types remains a challenge. Machine 



learning techniques are ideally suited for pattern recogni- 
tion tasks to accommodate diverse biological data sources 
into a single predictive analysis to achieve superior per- 
formance over any individual type of data, especially where 
(1) data sets are large, (2) with heterogeneous sources, and 
(3) patterns are not easily described by a compact set of 
rules, all of which are true for the task of genome-scale 
identification of host-microbe interaction factors. Super- 
vised machine learning schemes have been receiving in- 
creasing attention recently as a promising approach to 
study diverse biomedical problems [1-5], but no previous 
study focused on host-microbe interaction factors. In this 
study, we developed a supervised machine learning strategy 
to identify the gene inventory involved with host-microbe 
interaction from two soft rot-associated enterobacteria, 
Dickey a dadantii (aka. Erwinia chrysanthemi) 3937 [6], 
and Pectobacterium carotovorum (aka. Erwinia caroto- 
vorum) WPP14 [7]. Our approach allows us to incorp- 
orate a wide variety of input data, including homology 
information, genome context, predicted transcription 
factor binding sites, and microarray transcript profiles. 
It has achieved promising results with precision rate 
over 90% with recall rate over 80%. Further, our study 
generates an extended list of roughly 200 candidate 
interaction factors and provides experimentally test- 
able hypotheses to stimulate further research on the mo- 
lecular mechanisms of soft rot pathogenesis and survival 
in plant hosts. This study represents a promising applica- 
tion of pattern-recognition methods for identification 
of factors involved in complex biological processes, 
which can be generalized to study other plant-associated 
organisms. 

Methods 

Target genome selection 

Soft rot-associated enterobacteria are economically im- 
portant pathogens that infect a broad range of plant spe- 
cies [8-11]. Soft rot bacterial pathogenesis is characterized 
by rapid necrosis of parenchymatous tissues, mainly due 
to the action of secreted enzymes that degrade the middle 
lamellae and the primary cell wall [12]. Continuing discov- 
ery of additional genes involved in survival in a plant host 
or which contribute directly to pathogenesis [13-19] sug- 
gests that even for well-studied organism such as Dd3937, 
we have not yet achieved a comprehensive list of host- 
microbe interaction factors or a complete understanding 
of their precise roles. In this study, we target two soft rot- 
associated phytopathogens for genome-wide identification 
of host interaction factors (Table 1). One, Dickeya dadan- 
tii 3937 (Dd3937) was originally isolated from Saintpaulia 
ionantha [20,21], and is a long-standing model system for 
this group of organisms [6]; the other, Pectobacterium car- 
otovorum carotovorum WPP14 was isolated from infected 
potato in Wisconsin [7,22] . 
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Table 1 Genome-wide target class label assignment to each protein coding gene as a data point for Dickeya didantii 
3937 and Pectobacterium carotovorum WPP1 4 

Total # CDS* IF** CF** Training data set Testing data set Pseudogene 

Dd3937 4520 267 1264 1531 2989 28 

WPP14 4590 233 1111 1344 3246 174 

*we only use protein coding genes and pseudogenes are not included. 

**IF stands for host-microbe interaction factor; CF stands for genes involved in core biological processes 



Colonization and survival in plants requires numerous 
factors including proteins involved with iron assimila- 
tion, protein secretion, exopolysaccharide synthesis, mo- 
tility, and stress-resistance [23,24]. Five Gene Ontology 
terms were identified that partition the majority of the 
positive class training set data into distinct aspects of 
host-microbe interactions (Table 2). We included all data 
points in most of our analyses, but also conducted ana- 
lyses on the partitions defined by these GO annotations 
(Additional file la and lb). This allows us to test whether 
different subsystems contain distinct patterns that can be 
recognized by our learning schemes, while avoiding sub- 
systems with too few genes to provide sufficient informa- 
tion to train the learning schemes. 

Assembling training datasets 

The data set for each target genome is assembled separ- 
ately. Genome sequences, predicted proteins and anno- 
tations for both genomes were obtained from the ASAP 
database [25,26]. Each protein-coding gene in a target 
genome is considered a data point. The target class label 
in this specific learning task indicates whether or not a 
data point has an association with the biological pro- 
cesses involved in host-microbe interaction. A positive 
class label means the data point is related to host- 
microbe interaction. A negative class label indicates the 
data point is not likely to be directly involved in host- 



Table 2 Ontology for host-microbe interaction, and category 
assignment genome-wide for data points in Dickeya dadantii 
(Dd3937) and Pectobacterium carotovora (WPP 14) 



GO term and name 


Dd3937 


WPP14 


GQ0052192 movement in environment of other 
organism involved in symbiotic interaction; 


41 


41 


GO:0052048 interaction with host via secreted 
substance involved in symbiotic interaction 


54 


53 


GQ0051816 acquisition of nutrients from other 
organism during symbiotic interaction 


103 


81 


GO:0044413 avoidance of host defenses 


43 


34 


GO:0043903 regulation of symbiosis, encompassing 
mutualism through parasitism 


13 


9 


*GO:0044403 symbiosis, encompassing mutualism 
through parasitism 


13 


15 


Total 


267 


233 



*this term is a parent term for all others listed in this table and is used as a 
generic catch all for host-microbe interaction factors lacking more specific GO 
term annotations. 



microbe interaction, rather it is associated with core bio- 
logical processes such as transcription and translation or 
central pathways of metabolism. Positive and negative 
class labels were assigned by human experts. 

For each data point, we assemble a vector of features 
(or attributes), to characterize it. In our preliminary ana- 
lyses, we sought to be inclusive in construction of the 
data matrix. We included 606 attributes for Dd3937 and 
598 attributes for WPP14, and these attributes fall 
roughly into four different categories listed in Table 3. 
(1) Sequence homology data was obtained from BLASTP 
searches of the proteins from the target genomes against 
239 gamma-proteobacteria from 14 bacterial orders and 
58 genomes from other bacterial families outside of 
gamma-proteobacteria (details in Additional file 2a and 
2b). 2) We further summarized sequence homology infor- 
mation by classifying organisms based on phenotypes 
(e.g.., strict anaerobe), taxonomy (e.g.., the order of 
Enterobacteriales), habitat (e.g., aquatic), and host type 
(e.g. plant-associated). Based on this information, we 
calculated a series of attributes summarizing the hom- 
ology data. For instance, for each gene, we calculate 
the number of genomes with a homolog, the fraction 
of genomes with homologs that are plant-associated, 
the average similarity scores between homologs, the 
ratio of the similarity score of plant-associated versus 
animal-associated homologs, the percentage of hits in 
the order of Enterobacteriales, and the percentage in 
facultative anaerobic organisms, etc. Additional file 2c 
shows the number of genomes in each category used 
to generate summary attributes. 3) Information related 
to function and regulation including transcriptome and 
proteome profiles was incorporated into the attribute vec- 
tors (details in Additional file 3a), including microarray 
experiments with a pecS mutant strain [27], exposure to 
phenolic acids [28], and growth on potato tuber and stem 
[29]. For Dd3937, we also integrated the presence of pre- 
dicted binding sites for 32 transcriptional regulators, in- 
cluding ones related to gene regulation during infection 
such as PecS [17,27], KdgR [30], H-NS [31,32], and CRP 
[33,34]. We did not include binding site data for WPP 14 
because the large number of contigs complicates predic- 
tion. 4) Finally, we incorporated over 20 basic gene or pro- 
tein features (Table 3), such as GC content, amino acid 
composition and computed structural and physiochemical 
features of proteins and peptides [35], operon prediction 
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Table 3 List of all attributes categories used in data set formation in this study, and number of attributes in each 
categories for all data points in training data set for Dickeya dadantii (Dd3937) and Pectobacterium carotovorum 
(WPP14) 



Category 


Subcategory 


Dd3937 


WPP14 


Reference 


Sequence homology 


Subtotal 


297 


297 






Gamma strains 


239 


239 


Additional file 2a 




Non-gamma strains 


58 


58 


Additional file 2b 


Phenotypes of interest 


Subtotal 


194 


194 






Taxonomy Statistics 


76 


76 


Additional file 2c, d 




Lifestyle Statistics 


118 


118 


Additional file 2c, d 


Gene characteristics 


Subtotal 

GC content 


23 

1 


21 

1 


This study 




subcellular localization 


1 


1 


[42,46] 




phylogenetic profile 


6 


6 


[40,41] 




fingerprints scanning 


3 


3 


[43,44] 




codon adaptation index (CAI) 


3 


3 


[47,48] 




physical adjacency (gene neighbor) 


2 


2 


[49,50] 




Operon prediction 


1 


1 


[36,51] 




phylogenetic conservation 


1 


1 


This study 




COG functional category 


1 


1 


[52] 




Genomic island 


4 


1 


[53,54] 




computed structural and physicochemical 
features of proteins and peptides 


40 


66 


[35,55] 


Functional genomics 


Subtotal 


52 


3 






binding site prediction 


32 


0 


Additional file 3b 




Gene expression 


14 


3 


Additional file 3a 




proteomics 


6 


0 


Additional file 3a 




Total 


606 


581 





[36], COG functional category [37], and codon adaptation 
index [38,39]. Other gene features are derived from more 
complex analyses, including: (a) the phylogenetic profile 
method [40], which is based on the theoretical framework 
that co-occurrence of functionally linked proteins will be 
preserved by natural selection [41]; (b) Phylogenetic con- 
servation which classifies genes according to distribution 
at different branching depths based on our phylogenetic 
framework for enterobacteria [11]; (c) PSORTb v3.0 [42] 
which predicts localization as cytoplasmic, cytoplasmic 
membrane, periplasmic, extracellular, or unknown; (d) 
Protein fingerprint scanning (a similarity search tech- 
nique able to identify distantly related proteins) against 
identified fingerprints associated with virulence factors 
in PRINTS database [43,44]; and (e) the gene neighbor 
method which identifies gene physical adjacency on a 
chromosome [45], based on the theory that neutral evolu- 
tion tends to shuffle gene orders while functionally associ- 
ated genes have conserved gene order. We employ both 
150 bp and 300 bp as a threshold distance to define gene 
neighbors using ad hoc code. 



Overview of supervised machine learning procedures 

The learning procedure is illustrated in Figure 1. (1) First 
training and testing data sets are assembled by assigning 
target class labels and forming attribute vectors. (2) Data 
preprocessing is performed to improve representation 
and quality, including attribute selection and data trans- 
formation, as well as data partitioning according to GO 
annotations. (3) Both data preprocessing and pattern 
learning schemes were implemented in Weka package 
version 3.5.6. [56,57]. Both base and ensemble classifiers 
were trained to recognize classification patterns. Seven 
base classifiers were employed in this study including 
decision tree [58], support vector machine (SVM) using 
sequential minimal optimization [59-61], Bayesian prob- 
abilistic approaches including Bayesian network [62,63] 
and naive bayes [64], instance based learner k nearest 
neighbor [65], and propositional rule learner using re- 
peated incremental pruning to produce error reduction 
(RIPPER) [66]. On top of base classifiers, ensemble clas- 
sifiers, such as bagging and boosting classifiers, combine 
multiple models by either sub-sampling a given dataset 
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Figure 1 Flow chart of the procedures in performing supervised machine learning tasks of host-microbe interaction factor prediction. 



to achieve greater predictive accuracy and reduce over- 
fitting bias [67-70] or combining of probability estimates 
from different methods [71-73]. Detailed algorithm de- 
scriptions and specific settings are described in Additional 
file 4. (4) Classifier training is followed by classifier per- 
formance evaluation, comparison, and selection. Cross- 
validation is a technique to assess how accurately a 
predictive model will perform on an independent data 
set and whether the model recognizes a pattern that is 
generalized enough to apply to unseen data [74,75]. 
(5) Based on performance on the training set, we se- 
lected the best classifiers to build models and make 
predictions for the genes that were not part of the 
training sets. 

Data preprocessing 

Attribute extraction, or data transformation, was used to 
improve the representation of the data sets. Data trans- 
formation techniques create extracted attributes from 
the original attributes, in order to normalize so different 
attributes are on the same approximate scale, transform 
all numeric attributes in the dataset to have zero mean 
and unit variance [76], perform linear mapping of the 
data to a lower dimensional space in such a way that the 
variance of the data is maximized using principal com- 
ponents analysis (PCA) [77], or combine attributes where 
the aggregate feature is more useful than keeping them 
separate. Since many attributes used in our analysis are 
continuous data, we also employed data discretization 
techniques that convert continuous features to discretized 
or nominal ones to accommodate both data types in the 
same analysis [78,79]. Another important component 
in data preprocessing is attribution selection, which is 



removal of uninformative data since excessive dimen- 
sionality can reduce the effectiveness of learning tasks. 
It includes two steps: an initial clean-up step where the 
attributes of each type (as listed in Table 3) are tested 
individually in order to remove the ones with insignifi- 
cant contribution to classification, which is especially 
useful for the data types with highest dimensionality. 
The second step is to evaluate the importance of an at- 
tribute passed on from the initial step, and to remove 
the ones with low importance measurement scores. 
We used random forest attribute importance measures 
in this step, which are based on the decrease of classifier 
performance when values of a variable in a bifurcating tree 
node are permuted randomly [80], implemented in the ex- 
tended version of weka 3.5.1 [81,82] (More details in 
Additional file 4). Furthermore, we performed data decay 
analysis to define compact attribute sets that maintain in- 
formativeness. This involved ranking all attributes based 
on importance measures from 100 runs using random for- 
est classifiers, gradually decreasing the number of attri- 
butes by window size 10 based on their rank, recording 
the performance of all decayed data sets, and defining the 
essential set as the point where the overall performance 
score began to drop. 

Evaluating the performance of different learning schemes 

We used 10-fold cross-validation analyses to evaluate 
the learned classifiers on random subsets of data with- 
held from the training sets and averaged across multiple 
replicates. We recorded a variety of performance statis- 
tics for each run including accuracy, true positive rate 
(TPR or recall), and precision for the positive target class. 
We also used ROC (Receiver Operating Characteristic) 
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curves, PR (Precision-Recall) curves, and the AUC (area 
under the curve) to evaluate the performance of each con- 
structed classifier. In this particular learning task, we value 
precision rate as the most important statistic. Precision 
specifies the proportion of relevant objects being retrieved 
among all retrieved ones, a factor that is particularly im- 
portant to define a candidate list with high confidence for 
downstream experimental validation. On the other hand, 
recall is the proportion of relevant objects that are re- 
trieved. When a situation does not allow both precision 
and recall rates to be high at the same time, we give the 
precision rate precedence over the recall rate. ROC and 
PR curves are widely regarded as more appropriate than 
any individual statistic in evaluating classification algo- 
rithms [83]. A ROC curve is a graphical technique that 
plots the correlation of correctly classified data points with 
falsely classified ones, in order to characterize the tradeoff 
between true positive and false positive rates. PR curves 
depict the correlation of how precisely the algorithm iden- 
tifies the data points in their class with how many "true" 
data points are retrieved and provide a good complement 
to ROC curves which can be overly optimistic [84] . 

Results and discussion 

Many computational methods have been used to identify 
gene functions involved in host-microbe interaction, and 
most of them rely primarily on homology-based searches 
using known interaction determinants as bait to identify 
new candidate genes. These methods are often success- 
ful, but neglect many genes of unknown function and 
strain/clade-specific genes, which could play an import- 
ant role in host-microbe interactions and bacterial niche 



adaptations [85,86]. Overcoming these limitations with 
the current methodologies is critical to expanding our 
understanding of the complex molecular mechanisms 
underlying host-microbe interactions. The value of ma- 
chine learning not only lies in deriving knowledge based 
on pattern recognition, but also providing an automated 
alternative to having a human expert repeatedly sift through 
large and complex datasets. 

Some attributes are more useful than others to predict 
host-microbe interaction 

Our results indicate that although all categories outperform 
randomized data, different major categories of attributes 
contribute differently to learning scheme performance as 
shown in the ROC curve for Dd3937 in Figure 2 and 
Additional file 4. Gene features and summarized hom- 
ology information were most useful in classifying host- 
microbe interaction factors, while data related to computed 
structural or physiochemical characteristics, and gene 
functionality data, including gene expression, binding 
site predictions, and proteomics profiles, performed less 
well. Further analysis of the gene functionality attributes 
using random forest importance measurement scores in- 
dicates that the data corresponding to many of these attri- 
butes are relatively noisy and do not correlate well with 
the target class, though a subset, such as KdgR binding 
site predictions, do correlate well. Some of our attributes 
are themselves the results of other pattern recognition 
methods. For example, phylogenetic profiles, one of the 
most useful attributes, are based on an unsupervised 
learning approach, where no prior information is given 
to the learner regarding the output or class label. Our 
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Figure 2 ROC curve to compare classifier performance of different data sets containing various types of attributes as listed in Table 3. 

(TPR: True Positive Rate; FPR: False Positive Rate). 
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analysis is a good example of how supervised and un- 
supervised learning algorithms can be combined to make 
better inference. 

We conducted data decay analysis to obtain additional 
insight into the most informative attributes. The size of 
the final compact attribute sets is 45 and 31 for Dd3937 
and WPP14, respectively, as shown in Additional file 5b. 
The majorities of attributes in the compact sets are sum- 
maries of homology data according to phenotypes or 
computed gene features, and many of the retained attri- 
butes are shared between both strains despite the inde- 
pendent machine learning analyses. The common list 
includes five gene feature attributes including phylo- 
genetic profile, gene cluster from operon prediction, gene 
neighbor, cellular localization, and amino acid compos- 
ition. The most informative homology attributes include 
percentage, average value, or sum value of a given gene 
having homologous hits with organisms having different 
pathogenicity and habitat phenotypes. In addition, the 
homology data summarized by phenotypes related to 
growth condition and taxonomic groups is also inform- 
ative including having homologs in anaerobic organisms, 
facultative anaerobes and their ratio, and having homologs 
in other gamma-proteobacteria, and enterobacteria, all of 
which appear in the selected attribute list for both strains. 

Overall, these results suggest that attributes which are 
relatively simple to assemble from standard BLASTP 
searches, coupled with a handful of additional easily com- 
puted features are sufficient to achieve good performance 



in this machine learning task. This is particularly encour- 
aging for development of a generalized approach for fu- 
ture applications to predict host-interaction factors across 
a broad range of bacterial phytopathogens. 

Preprocessing and partitioning can improve performance 

The PR curve shown in Figure 3 illustrates the improve- 
ment in performance that we achieved through attribute 
selection, data discretization, and data partitioning ac- 
cording to GO terms. 1) Attribute selection generates 
more cost-effective learning schemes by reducing data set 
dimensionality by removing uninformative attributes, in 
order to improve the overall performance of the learning 
schemes [87,88]. After benchmarking different attribute 
selection techniques such as filter (e.g., subset attribute se- 
lection [89]) and wrapper methods (e.g., Naive Bayes with 
forward selection algorithm) as well as attribute ranking 
(e.g., SVM Attribute evaluator [90] and information gain), 
we chose random forest importance measures in this 
study because it is robust to noise, relatively computation- 
ally efficient, and is suitable for data sets with high dimen- 
sionality hence reducing the risk of overfitting [81]. After 
feature selection, our data sets contain 105 and 122 attri- 
butes, which are 17.3% and 21% of the original data size of 
Dd3937 and WPP14, respectively. 2) By comparing differ- 
ent data transformation techniques (Additional file 6a), 
supervised data discretization was shown to be substan- 
tially better for improving classifier performance than 
other methods. Supervised discretization techniques are 
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Figure 3 PR (Precision-Recall) curve to evaluate strategies for boosting classifier performance. 
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suitable for high dimensional data as they significantly re- 
duce the number of possible values of continuous features, 
and also discretize an attribute according to its class label 
[91,92]. 3) We also saw an improvement when we coupled 
the preprocessing with partitioning the learning task into 
several separate tasks based on assigning genes in the 
training set according to GO terms. This result suggests 
that some subsystems, such as localization in host and se- 
cretion of host interaction proteins, are substantially more 
informative and suitable for our learning task (Additional 
file 6b). Other subsystems, such interaction with host 
defense systems and transcriptional regulation of host 
interaction genes, performed less convincingly, possibly 
because these subsystems are involved in host-microbe 
interaction but also include other genes not implicated in 
this biological process. For example, the global DNA- 
binding regulator hns gene also modulates flagella genes 



and lipopolysaccharide production that are important for 
initial bacterial attachment to host cell surfaces [93,94]. 
These data points were removed from subsequent ana- 
lysis. Our result suggests that our learning schemes hold 
predictive power for the subsystems involved with com- 
plex biological processes during host-microbe interaction, 
but do not accurately distinguish the patterns for some 
subsystems that are closely intertwined with other cellular 
processes. 

The performance of machine learning schemes is 
statistically encouraging 

In this study, we employed several strategies to mitigate 
the potential overfitting issues that are important for ef- 
fective supervised machine learning tasks. Simply put, 
overfitting occurs when the predictive model learns a 
pattern that is overly specific to the training data but not 



Table 4 Statistics for positive class object prediction and parameters used in selected learning schemes for both 
Dickeya dadantii 3937 and Pectobacterium carotovorum WPP1 4 



Classifiers 


Precision 


TPR/recall/sensitivity 


specificity /TNR 


accuracy 


F-measure 


AUC 


Dd3937 














Random Forest 


0.93 


0.81 


0.98 


0.94 


0.87 


0.97 


Bayesian Network 


0.91 


0.85 


0.97 


0.94 


0.88 


0.97 


SMO using RBF kernels 


0.93 


0.85 


0.98 


0.95 


0.89 


0.92 


SMO using polynormial kernels 


0.91 


0.87 


0.97 


0.95 


0.95 


0.89 


Adaptive Boosting (Naive Bayes)* 


0.84 


0.89 


0.95 


0.93 


0.87 


0.96 


Adaptive Boosting (Decision Tree)* 


0.96 


0.91 


0.99 


0.97 


0.93 


0.98 


Adaptive Boosting (IBK)* 


0.96 


0.84 


0.99 


0.95 


0.90 


0.99 


Adaptive Boosting (Decision Stump)* 


0.92 


0.87 


0.98 


0.95 


0.89 


0.97 


Multi-Boosting (Decision Tree)* 


0.97 


0.91 


0.99 


0.97 


0.94 


0.98 


Multi-Boosting (IBK)* 


0.91 


0.77 


0.98 


0.93 


0.84 


0.93 


Multi-Boosting (Naive Bayes)* 


0.90 


0.91 


0.97 


0.95 


0.91 


0.96 


Logit-Boosting (Decision Stump)* 


0.91 


0.90 


0.97 


0.96 


0.91 


0.98 


WPP14 














Random Forest 


0.89 


0.81 


0.97 


0.93 


0.85 


0.97 


Bayesian Network 


0.90 


0.83 


0.97 


0.94 


0.87 


0.97 


SMO using RBF kernels 


0.94 


0.84 


0.98 


0.95 


0.89 


0.91 


SMO using polynormial kernels 


0.93 


0.86 


0.98 


0.95 


0.95 


0.89 


Adaptive Boosting (Naive Bayes)* 


0.89 


0.89 


0.97 


0.95 


0.89 


0.96 


Adaptive Boosting (Decision Tree)* 


0.95 


0.86 


0.99 


0.96 


0.90 


0.98 


Adaptive Boosting (IBK)* 


0.87 


0.83 


0.96 


0.93 


0.85 


0.92 


Logit-Boosting (Decision Stump)* 


0.90 


0.85 


0.97 


0.94 


0.88 


0.97 


Multi-Boosting (Decision Tree)* 


0.94 


0.86 


0.98 


0.96 


0.90 


0.98 


Multi-Boosting (Decision Stump)* 


0.91 


0.75 


0.98 


0.93 


0.82 


0.97 


Multi-Boosting (Naive Bayes)* 


0.90 


0.89 


0.97 


0.95 


0.89 


0.96 


Logit-Boosting (Decision Stump)* 


0.90 


0.87 


0.97 


0.95 


0.89 


0.97 



*: denote ensemble classifiers, with base learner being shown within parenthesis. 

Abbr: SMO: Support Vector Machine using Sequential Minimal Optimization; IBK: instance based learner with K-nearest neighbor classifier; RBF: Radial Basis 
Function kernels. 
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generalized enough to perform equally well on unseen 
data [95]. We strived to maximize inclusion of relevant 
attributes to mitigate the problem of overfitting to in- 
crease model replicability [96], while excluding unim- 
portant attributes that may be detrimental to pattern 
recognition schemes performance. Additionally, we hold 
out pristine examples for testing, integrated result over 
multiple classifiers retaining only predictions that show 
a high degree of consensus, chose classifier parameters 
based on the cross-validation tests, and used a simpler 
predictor where possible, to address the overfitting issue. 

Overall the results of using supervised machine learning 
schemes on host-microbe interaction factor prediction are 



statistically encouraging, achieving over 84% precision rate 
and 75% recall rate from 10-fold cross validation evaluation. 
We used a nested 10-fold cross-validation that includes an 
"outer" 10-round cross-validation, which averages data vari- 
ability from 10 different data partitions. Each data partition 
sets aside 10% of the data set (outer test set) to measure the 
performance of the predictive model generated from the 
other 90% of the data (outer training set). Each outer train- 
ing set is used to choose the value of tuning parameters for 
this model in order to achieve optimal performance. The 
parameter-tuning step is especially important for SVM and 
K-nearest neighbor learning schemes which are particularly 
sensitive to parameter settings (Stone 1977). Performance 
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ROC curve: Comparison of selected classifiers with the most superior performance 
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ROC curve: Comparison of selected classifiers with the most superior performance 
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Figure 4 Comparison of the selected learning schemes, (a) ROC curve for Dickeya dadantii 3937, (b) ROC curve for Pectobacterium 
carotovorum WPP14. (JPR: True Positive Rate; FPR: False Positive Rate). 
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Table 5 Top 50 predicted host-microbe interaction factors from Dickeya dadantii 3937 



FeaturelD 



Prob 



Name 



Annotation 



ABF-0018715 

ABF-0020188 

ABF-0019950 
ABF-0019360 
ABF-001 9151 
ABF-0019124 
ABF-001 91 22 
ABF-001 91 17 
ABF-001 91 16 
ABF-001 8783 
ABF-001 8775 
ABF-001 8724 
ABF-001 8722 
ABF-0047137 
ABF-001 871 7 

ABF-001 871 6 
ABF-001 871 3 

ABF-001 871 2 

ABF-001 8601 
ABF-001 8207 
ABF-001 81 99 
ABF-001 7777 
ABF-001 5606 
ABF-001 5604 

ABF-001 5543 

ABF-001 5387 
ABF-001 4838 
ABF-0014623 
ABF-001 8720 

ABF-0018714 
ABF-0047204 
ABF-001 591 3 
ABF-001 7252 
ABF-001 81 95 
ABF-001 6407 
ABF-001 8205 
ABF-001 941 8 



0.922 

0.922 

0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 
0.922 

0.922 
0.922 

0.922 

0.922 
0.922 
0.922 
0.922 
0.922 
0.922 

0.922 

0.922 
0.922 
0.922 
0.922 

0.922 
0.922 
0.921 
0.921 
0.921 
0.921 
0.921 
0.921 



virB8 



sftP 



virB2 

virB6 

virB7 
virBI 0 

virB1 1 



ganC 
hecA2 



nipE 

virB4 

virB9 
ppdA 
ganG 



Inner membrane protein forms channel for type IV secretion 
of T-DNA complex (VirB8) 

Predicted cell-wall-anchored protein SasA (LPXTG motif) this 
is up-regulated by hrpY; we have a mutation in this gene. 

Putative multicopper oxidase 

hypothetical protein 

chrysobactin synthetase cbsF 

Biopolymer transport protein ExbD/ToIR 

MotA^olQ/ExbB proton channel family protein 

TonB-dependent receptor 

hypothetical protein 

putative transmembrane protein 

Holin 

putative ATP/GTP-binding protein remnant 

Major pilus subunit of type IV secretion complex (VirB2) 

hypothetical protein 

Integral inner membrane protein of type IV secretion 
complex (VirB6) 

TriF protein 

Inner membrane protein forms channel for type IV secretion 
of T-DNA complex (VirBI 0) 

ATPase provides energy for both assembly of type IV secretion 
complex and secretion 
of T-DNA complex (VirB1 1) 

hypothetical protein 

hypothetical protein 

putative truncated PTS system EIIBC component 
Putative member of ShIA/HecA/FhaA exoprotein family 
ABC transporter permease protein 

Amino acid ABC transporter, periplasmic amino acid-binding 
protein 

hypothetical protein 15544 is up-regulated by hrpY. Is 15543 

in the same operon? 

We have a mutation in 15544 

necrosis-inducing protein 

putative exported protein 

Type IV pilus biogenesis protein PilN 

ATPase provides energy for both assembly of type IV secretion 
complex and secretion 
of T-DNA complex (VirB4) 

VirB9 

hypothetical protein 

Prepilin peptidase dependent protein A 

Conjugative transfer protein TrbG 

galactan ABC transport system, permease component 

hypothetical protein 

Pirin 

Cellulose 1, 4-beta-cellobiosidase precursor 
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Table 5 Top 50 predicted host-microbe interaction factors from Dickeya dadantii 3937 (Continued) 



ABr-UU I y4oo 


u.yz i 




a or r\n r r 

ADr-UU I yboo 


0.921 




art nm aaqh 
ADr-UU I OOoU 


n en 1 
u.yz I 




art nmn"7T7 
AbT-UUzU/Z/ 


u.yz i 


sttG 


adt nn i m 1 r 

Adt-OO lyjlb 


0.921 




ABF-0015381 


0.921 


avrM 


ABF-0018723 


0.921 


virB1 


ABF-0015598 


0.921 




ABF-0015609 


0.921 




ABF-0018193 


0.921 


ganF 


ABF-0017097 


0.921 




ABF-0020433 


0.921 




ABF-0019153 


0.921 


cbsH 



statistics for different classifiers are listed in Table 4, exclud- 
ing classifiers with precision rates < 80%. ROC curves of se- 
lected classifiers for WPP 14 are shown in Figure 4. 

The comparison of base classifier performances indi- 
cates SVM and random forest outperforms other base 
classifiers (data not shown), and ensemble classifiers 
generally perform better than base classifiers, especially 
the boosting algorithms using decision trees as the base 
learner. The ensemble classifiers integrate results over 
multiple classifiers in order to average out the "classifier 
effect". For example, some classifiers such as Naive Bayes 
can be overly optimistic with a lower precision rate [97], 
and adaptive boosting ensemble classifiers with Naive 
Bayes as the base learner can optimize precision and total 
accuracy rate through incrementally iterative learning pro- 
cesses [98]. The performance curves of selected classifiers 
are shown in Figure 4a and 4b for Dd3937 and WPP14 re- 
spectively. The best performing classifier for Dd3937 is 
the adaptive boosting ensemble classifier [70] with deci- 
sion trees as the base learner, which achieved a precision 
rate above 97% with over 87% recall rate. The best per- 
forming classifier for WPP 14 is the multi-boosting ensem- 
ble classifier [69] with decision trees as the base learner, 
which reached a precision above 94% with over 82% recall 
rate. Using the constructed predictive models from se- 
lected classifiers, we are able to make predictions for data 
points with previously unknown relation to host-microbe 
interactions. 

A significantly extended list of host-microbe interaction 
factors is revealed 

Application of different learned classifiers to the target 
genomes as a whole allows us to generate a conservative 
set of predictions for downstream experimentation. We 
pay the most attention to precision to ensure the re- 
trieved data points are most relevant to host-microbe 



hypothetical protein 

hypothetical protein 

Iron utilization protein 

General secretion pathway protein G 

hypothetical protein 

Avirulence protein 

VirBI 

hypothetical protein 

Branched-chain amino acid aminotransferase 
galactan ABC transport system, permease component 
Methyl-accepting chemotaxis protein 
hypothetical protein 
chrysobactin oligopeptidase CbsH 



interaction to facilitate subsequent experimental valid- 
ation. In order to call a gene a "predicted host-interaction 
factor", we required strict consensus across the different 
classifiers with an average precision score in excess of 
thresholds defined by the ROC curves (92% and 89% for 
Dd3937 and WPP14, respectively). The selected classifiers 
generally agree with each other, and about two thirds of 
all unknown genes are unanimously predicted by all clas- 
sifiers to be either host-microbe interaction factors or 
genes involved in core biological processes. Using these 
criteria, a total of 1726 genes (57.7% of Dd3937 genes) 
in Dd3937 and 2180 genes (67.2% of WPP14 genes) in 
WPP 14 are predicted not to involved in host- microbe in- 
teractions. There are 211 genes (7.1% of Dd3937 genes) in 
Dd3937 and 216 genes (6.7% of WPP 14 genes) in WPP 14 
classified as putative interaction factors. The remaining 
1052 genes (35.1% of Dd3937 genes) and 850 genes 
(26.2% of WPP 14 genes) are left as unclassified. The top 
50 predicted host-microbe interaction factors for Dickeya 
dadantii 3937 and Pectobacterium carotovorum WPP14 
are listed in Tables 5, 6 and 7, and the entire list of pre- 
dicted host-microbe interaction factors for both strains 
are in Additional file 7a and 7b. These lists partially over- 
lap, with 56 orthologs identified as interaction factors in 
both organisms. Given the phylogenetic relationship be- 
tween these two phytopathogens and the similarity of 
their pathogenic phenotypes, we did expect this result; 
however, the learning tasks were executed independendy 
and agreement across organisms was not a given. 

One striking observation is the large number of genes 
of unknown function from the predicted list of host- 
microbe interaction factors. Among all predicted inter- 
action factors, over 30% of them currently have no or 
very little annotated information, and many of them are 
ORFans [99-102] without any homolog to 297 bacterial 
genomes inspected. Among the 56 genes found in 
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Table 6 Top 50 predicted host-microbe interaction factors 
from Pectobacterium carotovorum WPP14 

Name Product 

hypothetical protein 
putative exported protein 
dltB peptidoglycan biosynthesis protein 
pectate lyase 

methyl-accepting chemotaxis protein 

dltD poly(glycerophosphate chain) D-alanine 
transfer protein 

ABC transporter ATP binding protein 

hypothetical protein 

hasE HlyD family secretion protein 

N-terminal fragment of a diguanylate 
cyclase (pseudogene) 

methyl-accepting chemotaxis protein 

methyl-accepting chemotaxis protein 

hypothetical protein 

hypothetical protein 

methyl-accepting chemotaxis protein 

ABC transporter permease protein 

methyl-accepting chemotaxis protein 

methyl-accepting chemotaxis protein 

putative type IV pilus protein 

LysR-family transcriptional regulator 

astB sulfate ester ABC transporter permease 
protein 

methyl-accepting chemotaxis protein 

methyl-accepting chemotaxis protein 

ABC transporter, substrate binding protein 

putative exported protein 

putative signaling protein 

LysR-family transcriptional regulator 

acyl carrier protein 

putative cellulase 

putative lipoprotein 

hypothetical protein 

sftP TonB-dependent receptor 

putative exported protein 

putative membrane protein 

hypothetical protein 

LysR-family transcriptional regulator 

hypothetical protein 

dltA putative D-alanine-poly(phosphoribitol) 
ligase subunit 1 

putative transport system membrane 
protein 

hypothetical protein 



Table 6 Top 50 predicted host-microbe interaction factors 
from Pectobacterium carotovorum WPP14 (Continued) 



ID Prob 

ADT-0001591 0.912 

ADT-0003750 0.912 

ADT-0000805 0.912 

ADT-0003928 0.911 

ADT-0003247 0.911 

ADT-0000806 0.911 

ADT-0003745 0.911 

ADT-0002063 0.911 

ADT-0000400 0.911 

ADT-0003089 0.911 

ADT-0003418 0.910 

ADT-0000941 0.910 

ADT-0006368 0.910 

ADT-0005582 0.910 

ADT-0000983 0.910 

ADT-0001252 0.910 

ADT-0003245 0.910 

ADT-0000027 0.910 

ADT-0003542 0.909 

ADT-0001195 0.909 

ADT-0004315 0.909 



ADT-0003152 
ADT-0002357 
ADT-0000543 
ADT-0001392 
ADT-0002087 
ADT-0001868 
ADT-0000803 
ADT-0000571 
ADT-0000535 
ADT-0001404 
ADT-0004320 
ADT-0001744 
ADT-0003391 
ADT-0003535 
ADT-0003563 
ADT-0001980 
ADT-0000804 



0.909 
0.908 
0.908 
0.908 
0.908 
0.908 
0.908 
0.908 
0.908 
0.907 
0.907 
0.907 
0.907 
0.907 
0.907 
0.907 
0.907 



ADT-0001 320 


0.906 


methyl-accepting chemotaxis protein 


ADT-0001 567 


0.906 


putative exported protein 


ADT-0005614 


0.906 


hypothetical protein 


ADT-0001 436 


0.906 


putative component of polysulfide 
reductase 


ADT-0004253 


0.906 occQ 


octopine transport system permease 
protein 


ADT-0001 493 


0.905 


hypothetical protein 


ADT-0001 492 


0.905 


putative lipoprotein 


ADT-0002704 


0.905 


putative lipoprotein 


ADT-0002584 


0.905 


ABC transporter, membrane spanning 
protein 



ADT-0001 61 6 0.S 
ADT-0001 394 0.S 



interaction factor lists for both strains, roughly one 
third have no clear functional assignment. 13 hypo- 
thetical proteins in both strain lists are "unknown un- 
knowns", a term used to indicate there is no information 
at all available for that gene [103]. The other 9 of them are 
so-called "known unknown" proteins, meaning they only 
have information in general biological terms, such as puta- 
tive exported protein, putative transmembrane protein, 
and probable lipoprotein. This result suggests a substantial 
portion of the genome cannot be screened using conven- 
tional similarity-based searches, and our more sophisti- 
cated pattern recognition approach was able to identify 
candidate interaction factors that would be missed using 
homology-based methods. 

The remaining two-thirds of predicted interaction fac- 
tors are annotated with various (at least partially) inform- 
ative functions. The lists include genes with previously 
characterized roles in host-microbe interaction in these or 
very closely related organisms that were overlooked by the 
human experts who assembled the training set. For ex- 
ample, Dd3937 secretes plant cell wall degrading enzymes 
through a type II secretion system for plant host cell wall 
degradation in turn using the released nutrients as carbon 
sources for growth [104], and a group of genes related to 
this process are predicted with high confidence including 
predicted proteins previously reported to play an accessory 
role in utilization of galactose, a major component of pec- 
tin, in Dd3937 [105]. A knockout mutant of a necrosis- 
inducing protein included in the prediction list has been 
experimentally shown to have reduced virulence in a Pecto- 
bacterium strain [106]. Further, our lists also include genes 
with homologs implicated in host-microbe interaction in 
more distantiy related organisms. There are 9 genes that 
were shown with direct or indirect evidence to be involved 
with metal homeostasis in different bacteria, including 
exbB, exbD, and tonB genes which are essential for ferric 
iron uptake in Escherichia coli [107], Xanthomonas cam- 
pestris [108], Pseudomonas putida [109], and Photorhabdus 
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Table 7 List of 56 genes predicted host-microbe interaction factors in both Dickeya dadantii 3937 and Pectobacterium 
carotovorum WPP1 4 



Dd3937 






WPP14 






FeaturelD 


Name 


Product 


FeaturelD 


Name 


Product 


ABF-00191 17 


sftP 


TonB-dependent receptor 


ADT-0004320 


sftP 


TonB-dependent receptor 


ABF-0019116 




hypothetical protein 


ADT-0004318 




unknown 


ABF-0018207 




hypothetical protein 


ADT-0001980 




hypothetical protein 


ABF-0015604 




Amino acid ABC transporter 


ADT-0000748 




putative extracellular solute-binding protein 


ABF-0015387 


nipE 


necrosis-inducing protein 


ADT-0000781 




putative exported protein 


ABF-0014838 




putative exported protein 


ADT-0002655 




putative exported protein 


ABF-00191 24 




Biopolymer transport protein ExbD/ToIR 


ADT-0002263 




putative biopolymer transport protein 


ABF-00191 15 




hypothetical protein 


ADT-0002265 




hypothetical protein 


ABF-0017097 




Methyl-accepting chemotaxis protein 


ADT-0003418 




methyl-accepting chemotaxis protein 


ABF-0019566 




hypothetical protein 


ADT-0001832 




putative exported protein 


ABF-0016407 




hypothetical protein 


ADT-0001404 




hypothetical protein 


ABF-0015906 




6-phosphogluconolactonase 


ADT-0003106 




putative exported protein 


ABF-00191 18 


atsR 


Alkanesulfonates-binding protein 


ADT-0001174 


atsR 


putative sulfate ester binding protein 


ABF-00191 25 


astB 


Alkanesulfonates transport system 
permease protein 


ADT-0004315 


astB 


sulfate ester ABC transporter permease protein 


ABF-0017125 


inh 


Alkaline proteinase inhibitor precursor 


ADT-0001911 


inh 


protease inhibitor 


ABF-0019002 




hypothetical protein 


ADT-0001744 




putative exported protein 


ABF-0019205 




ABC transporter 


ADT-0002584 




ABC transporter 


ABF-0014642 




hypothetical protein 


ADT-0000571 




putative cellulase 


ABF-0019092 




Transcriptional activator protein lysR 


ADT-0001 195 




LysR-family transcriptional regulator 


ABF-0016585 




Methyl-accepting chemotaxis protein 


ADT-0001320 




methyl-accepting chemotaxis protein 


ABF-0019383 




D-alanyl transfer protein DltB 


ADT-0000805 


dltB 


peptidoglycan biosynthesis protein 


ABF-0019855 




Methyl-accepting chemotaxis protein II 
(aspartate chemoreceptor protein) 


ADT-0001 887 




putative methyl-accepting chemotaxis protein 


ABF-0015168 


chmX 


Methyl-accepting chemotaxis protein III 
(ribose and galactose chemoreceptor protein) 


ADT-0003152 




methyl-accepting chemotaxis protein 


ABF-0018737 




DNA-binding protein 


ADT-0003335 




putative regulatory protein 


ABF-0019933 




hypothetical protein 


ADT-0003354 




hypothetical protein 


ABF-0014645 




Paraquat-inducible protein A 


ADT-0002701 




putative membrane protein 


ABF-0017674 




Methyl-accepting chemotaxis protein 


ADT-0003245 




methyl-accepting chemotaxis protein 


ABF-0020681 




hypothetical protein 


ADT-000241 8 




RES domain-containing protein 


ABF-0015907 




TonB-dependent hemin 


ADT-0002398 




TonB-dependent hemin 


ABF-0018934 




4-aminobutyrate aminotransferase 


ADT-0002845 




putative class-Ill aminotransferase 


ABF-0014824 




Methyl-accepting chemotaxis protein II 
(aspartate chemoreceptor protein) 


ADT-0002104 




methyl-accepting chemotaxis protein 


ABF-00181 78 




Iron(lll) dicitrate-binding protein 


ADT-0002009 




putative periplasmic substrate-binding 
transport protein 


ABF-0019391 




Pectate lyase 


ADT-0003928 




pectate lyase 


ABF-0015887 




hypothetical protein 


ADT-0002063 




hypothetical protein 


ABF-00161 15 




Methyl-accepting chemotaxis protein 


ADT-0000027 




methyl-accepting chemotaxis protein 


ABF-00191 01 


atsB 


Alkanesulfonates transport system permease 
protein 


ADT-0003749 


atsB 


putative sulfate ester transporter 


ABF-001 9214 




Glucosamine kinase GpsK 


ADT-0003604 




hypothetical protein 


ABF-0016752 




Ferric siderophore transport system 


ADT-0003559 




TonB-like protein 
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Table 7 List of 56 genes predicted host-microbe interaction factors in both Dickeya dadantii 3937 and Pectobacterium 
carotovorum WPP1 4 (Continued) 



ABF-0016218 Fosmidomycin resistance protein 

ABF-0046571 Putative DNA-binding transcriptional regulatory 

family of the TetR family 

ABF-0014644 Probable lipoprotein 

ABF-001 591 8 ppdC Putative prepilin peptidase dependent protein 

ABF-0018572 ABC transporter 

ABF-001 7527 Lysophospholipase 

ABF-0047106 putative lipoprotein 

ABF-0016810 Drug resistance transporter 

ABF-001 9088 Dihydrodipicolinate synthase 

ABF-0014868 Ferrichrome-iron receptor 

ABF-001 7095 hypothetical protein 

ABF-001 8540 Oxidoreductase 

ABF-0014948 hypothetical protein 

ABF-0020431 Methyl-accepting chemotaxis protein I (serine 
chemoreceptor protein) 

ABF-0019851 Methyl-accepting chemotaxis protein III (ribose 

and galactose chemoreceptor protein) 

ABF-0020368 hypothetical protein 

ABF-0016058 Poly(glycerophosphate chain) D-alanine 

transfer protein DltD 

ABF-001 921 2 N-Acetyl-D-glucosamine ABC transport 

system 



ADT-0001196 
ADT-0003719 

ADT-0000406 
ADT-0002557 ppdC 

ADT-0001164 
ADT-0001494 
ADT-0002704 
ADT-0001435 
ADT-0002292 
ADT-0004187 
ADT-0000555 
ADT-0000962 
ADT-0002252 
ADT-0000661 

ADT-0001602 

ADT-0002020 
ADT-0000806 dltD 

ADT-0002138 



MFS efflux transporter 

TetR-family transcriptional regulator 

putative lipoprotein 

putative prepilin peptidase dependent 
protein c precursor 

putative iron (III) ABC transporter 

putative lipoprotein 

putative lipoprotein 

putative membrane protein 

putative dihydrodipicolinate synthetase 

TonB dependent receptor 

putative exported protein 

probable short-chain dehydrogenase 

putative exported protein 

methyl-accepting chemotaxis protein 

methyl-accepting chemotaxis protein 

putative exported protein 

polyfglycerophosphate chain) D-alanine 
transfer protein 

extracellular solute-binding protein 



temperate [110], as well as ferric siderophore transporter 
and ferrichrome-iron receptor genes, and a cytochrome b 
gene {cybQ that is positively regulated by Fur and others 
that encode iron-dependent proteins in Salmonella enter- 
ica [111]. The predicted lists also include orthologs of the 
dltB gene implicated in cell surface adhesion in Staphylo- 
coccus aureus [112], the srfA gene that encodes secreted ef- 
fect or protein in Pantoea ananatis [113], a LysR-family 
regulator associated with quorum sensing in Pseudomonas 
aeruginosa [114], the cell- wall-anchored protein SasA sug- 
gested to play a role in adhesion to host in Staphylococcus 
aureus [115], and the ppdC gene involved in extracellular 
secretion machinery in Pseudomonas aeruginosa [116]. 
Additionally, we also observed many predicted interaction 
factors that are physically clustered together on the chromo- 
some. For instance, our prediction list includes an 11-gene 
cluster for a general secretion system, and a 12-gene cluster 
that may be associated with type IV secretion complex for- 
mation. This result agrees with previous studies that many 
virulence properties of microbes are a collaborative ef- 
fort of multiple genes and their physical clustering 
(and/or co-expression as operons) is under functional 
and evolutionary constraints [117,118]. 

Interestingly, our predicted host-microbe interaction 
factor lists include at least 17 chemotaxis or motility 



associated proteins for each organism, including putative 
methyl-accepting chemotaxis receptors and one type IV 
pilus biogenesis protein involved in bacterial motility 
and adhesion to a solid surface [119]. Previous studies 
have indicated the chemotactic responses with specific 
cellular localization are critical for biofilm formation and 
interaction with hosts in a variety of pathogenic bacteria 
[120-124]. The hypergeometric distribution was used to 
assess the statistical significance of enrichment of a 
given functional group in the target list relative to the 
genome as a whole [125,126]. Interpro family annota- 
tions were uniformly assigned across both genomes and 
we conducted enrichment tests based on assignment to the 
Interpro chemotaxis family. The highly significant p-values 
for both Dd3937 (p = 3.42e-ll) and WPP14 (p = 3.36e-12) 
strongly suggest methyl-accepting chemotaxis genes are 
highly enriched among the predicted host-microbe inter- 
action factors. 

Our learning strategy was explicitly designed to separate 
genes likely to be involved in host-microbe interaction 
from genes involved with core biological processes. The 
evidence above strongly suggests that the method is effect- 
ive at recognizing host-microbe interaction factors, but it 
is important to keep in mind that it does not directly ad- 
dress the possibility that some genes associated with core 
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biological processes may also contribute to interaction 
with hosts. Direct experimental testing of a relatively large 
number of genes from both the positive and negative clas- 
ses is underway and will illuminate the power of this ma- 
chine learning approach to guide discovery. 

Conclusion 

Although bacterial pathogen genome sequencing has be- 
come routine, the large number of unknown genes has 
been, and still is, a major obstacle to understanding the 
mechanisms of infection and adaptive evolution of mi- 
crobial pathogens overall. We successfully employed su- 
pervised machine learning to identify candidate host 
interaction factors and we are able to predict host-microbe 
interaction factors from among genes of entirely unknown 
function, for two important agricultural pathogens Dickeya 
dadantii Dd3937 and Pectobacterium carotovorum WPP14, 
achieving promising results with a precision rate over 90% 
with a recall rate over 80%. The predictions made in this 
study include many genes that have not previously been 
linked to host microbe interaction, a result not achievable 
with homology-based search strategies, providing an ex- 
panded list of appealing targets for further experimental val- 
idation. Our results indicate the learning schemes used in 
this study can recognize the complex patterns of host- 
microbe interaction factors and yield biologically meaningful 
results. Because of the powerful and intelligent models su- 
pervised machine learning schemes are capable of construct- 
ing, their future application to studying additional complex 
biological processes is likely to be a productive research 
approach. 

Availability of supporting data 

The data sets supporting the results of this article are avail- 
able in the LabArchives repository, [https://mynotebook. 
labarchives.com/share/plantpath/MjAuOHwyNTc20C8xN 
i9UcmVlTm9kZS8yNjQ4MTE0NTE0fDUyLjg=]. 
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Additional file 6: (a) RP curve to compare strategies of data 
transformation for boosting classifier performance, (b) PR curve to 
compare classifier performance using five data sets partitioned according 
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Additional file 7: The list of predicted host-microbe interaction 
factors, (a) List of 21 1 genes predicted to be host-microbe interaction 
factors for Dickeya dadantii 3937, (b) list of 216 genes predicted to be 
host-microbe interaction factors for Pectobacterium carotovorum 
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host-microbe interaction factors for both Dickeya dadantii 3937 and 
Pectobacterium carotovorum WPP 14. 
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